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Abstract 


We propose a novel method for speeding up stochastic optimization algorithms 
via sketching methods, which recently became a powerful tool for accelerating al¬ 
gorithms for numerical linear algebra. We revisit the method of conditioning for 
accelerating first-order methods and suggest the use of sketching methods for con¬ 
structing a cheap conditioner that attains a significant speedup with respect to the 
Stochastic Gradient Descent (SGD) algorithm. While our theoretical guarantees 
assume convexity, we discuss the applicability of our method to deep neural net¬ 
works, and experimentally demonstrate its merits. 


1 Introduction 


We consider empirical loss minimization problems of the form: 



( 1 ) 


where for every i, xi is an n-dimensional vector and is a loss function from R*’ to K. 
For example, in multiclass categorization with the logistic-loss, we have that (a) = 

log exp(ay — . Later in the paper we will generalize the discussion to the 

case in which W is the weight matrix of an intermediate layer of a deep neural network. 

We consider the large data regime, in which m is large. A popular algorithm for 
this case is Stochastic Gradient Descent (SGD). The basic idea is to initialize Wi to be 
some matrix, and at each time t to draw an index i G {1,... ,Tn} uniformly at random 
from the training sequence S = ((xi, j/i),..., {xm,ym)), and then update Wt+i based 
on the gradient of £y^{Wxi) at W. When performing this update we would like to 
decrease the value of iy.{Wxi) while bearing in mind that we only look on a single 
example, and therefore we should not change W too much. This can be formalized by 
an update of the form 
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where D{-,-) is some distance measure between matrices and 77 , the learning rate, 
controls the tradeoff between the desire to minimize the function and the desire to stay 
close to Wt- Since we keep W close to Wt, we can further simplify things by using the 
first-order approximation of around WtXi, 

£y,{WXi) ^ iy,{WtXi) + {W-Wt, {y£y^{WtXi))xJ) , 

where {WtXi) € is the (sub)gradient of at the p-dimensional vector WtXi 
(as a column vector), and for two matrices A,Bwe use the notation {A,B) = j ^i,j Bij ■ 
Hence, the update becomes 

Wt+i = argmin :^D(W, Wt) + + {W-Wt, {Viy,iWtX,))xJ) ( 2 ) 

Equation ^ defines a family of algorithms, where different instances are derived 
by specifying the distance measure D. The simplest choice of D is the squared Frobe- 
nius norm regularize^ namely, 

D{W,Wt) = \\W-Wt\\l = {W-Wt,W-Wt) . 

It is easy to verify that for this choice of D, the update given in Equation (O becomes 

Wt+i=Wt-vi^iy,iWtX,))xJ , 

which is exactly the update rule of SGD. Note that the Frobenius norm distance mea¬ 
sure can be rewritten as 

D{W, Wt) = {I, {W- Wt)^{W - Wt)) 

In this paper, we consider the family of distance measures of the form 

DaIW, Wt) = {A,{W- Wt)^(W - Wt)) 

where A is a positive definite matrix. For every such choice of A, the update given in 
Equation (|2|l becomes 

Wt+i = Wt- 7j(V£y^(Wtx,))(A-^x,)^ . (3) 

We refer to the matrix A as a conditioning matrix (for a reason that will become clear 
shortly) and call the resulting algorithm Conditioned SGD. 

How should we choose the conditioning matrix A1 There are two considerations. 

First, we would like to choose A so that the algorithm will converge to a solution of 
Equation ([T]i as fast as possible. Second, we would like that it will be easy to compute 
both the matrix A and the update rule given in Equation Q. 

We start with the first consideration. Naturally, the convergence of the Conditioned 
SGD algorithm depends on the specific problem at hand. However, we can rely on 
convergence bounds and picks A that minimizes these bounds. Concretely, assuming 
that each (.y^ is convex and p-Lipschitz, denote C" = ^ YllLi '^he correlation 
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matrix of the data, and let W* be an optimum of Equation ([T]l, then the sub-optimality 
of the Conditioned SGD algorithm after performing T iterations is upper bounded by 

^DA(W*,W,) + ^tr(A-^C). 

We still cannot minimize this bound w.r.t. A as we do not know the value of W*. So, 
we further upper bound Da{W*, Wi) by considering two possible bounds. Denoting 
the spectral norm and the trace norm by || • ||sp and II • lltr, respectively, we have 

1 . Da{W\Wi) < \\A\Up\\iW*-Wi)^{W*-Wi)\\tr 

2 . Da{W\Wi) < \\A\Ur\\{W*-Wi)^iW*-Wi)\Up 

Interestingly, for the first possibility above, the optimal A becomes A = I, correspond¬ 
ing to the vanilla SGD algorithm. However, for the second possibility, we show that 
the optimal A becomes A = . The ratio between the required number of iterations 

to achieve e sub-optimality is 

# iterations for A = / \\{W* — WiY{W* — Wi)\\tr HGHtr 

# iterations for A = ^ \\{W* - Wi)^ {W* - Wi)\\sp 

The above ratio is always between 1/n and min{n,p}. We argue that in many typical 
cases the ratio will be greater than 1, meaning that the conditioner A — will 

lead to faster convergence. For example, suppose that the norm of each row of W* is 
order of 1, but the rows are not correlated. Let us also choose W\ = 0 and assume 
that p = 0(n). Then, -Wij' '' (w* -Wijjl*" order of n. On the other hand, if the 
eigenvalues of C decay fast, then ~ 1- Therefore, in such scenarios, using 

the conditioner A = will lead to a factor of n less iterations relatively to vanilla 
SGD. 

Getting back to the question of how to choose A, the second consideration that we 
have mentioned is the time required to compute A~^ and to apply the update rule given 
in Equation ©. As we will show later, the time required to compute A~^ is less of an 
issue relatively to the time of applying the update rule at each iteration, so we focus on 
the latter. 

Observe that the time required to apply Equation (|3]l is order of (p+n)n. Therefore, 
if p Ri n then we have no significant overhead in applying the conditioner relatively 
to applying vanilla SGD. If p <C n, then the update time is dominated by the time 
required to compute A~^Xi. To decrease this time, we propose to use A of the form 
QBQ^ + a{I — QQ^), where Q S has orthonormal columns, B G is 

invertible and k n. We use linear sketching techniques (see || 2 ^ ) for constructing 
this conditioner efficiently, and therefore we refer to the resulting algorithm as Sketched 
Conditioned SGD (SCSGD). Intuitively, the sketched conditioner is a combination of 
the two conditioners A = I and A = , where the matrix QBQ^ captures the 

top eigenvalues of C and the matrix a{I — QQ^) deals with the smaller eigenvalues 
of C. We show that if the eigenvalues of C decay fast enough then SCSGD enjoys 
similar speedup to the full conditioner A = The advantage of using the sketched 
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conditioner is that the time required to apply Equation Q becomes {p+k)n. Therefore, 
if p > fc then the runtime per iteration of SCSGD and the runtime per iteration of the 
vanilla SGD are of the same order of magnitude. 

The rest of the paper is organized as follows. In the next subsection we survey some 
related work. In Section |2] we describe in detail our conditioning method. Finally, 
in Section |3] we discuss variants of the method that are applicable to deep learning 
problems and report some preliminary experiments showing the merits of conditioning 
for deep learning problems. Due to the lack of space, proofs are omitted and can be 
found in AnnendixlAl 

1.1 Related work 

Conditioning is a well established technique in optimization aiming at choosing an 
“appropriate” coordinate system for the optimization process. For twice differentiable 
objectives, maybe the most well known approach is Newton’s method which dynami¬ 
cally changes the coordinate system according to the Hessian of the objective around 
the cmi'ent solution. There are several problems with utilizing the Hessian. First, in our 
case, the Hessian matrix is of size ijjn) x {pn). Hence, it is computationally expensive 
to compute and invert it. Second, even for convex problems, the Hessian matrix might 
be meaningless. For example, for linear regression with the absolute loss the Hessian 
matrix is the zero matrix almost everywhere. Third, when the number of training ex¬ 
amples is very large, stochastic methods are preferable and it is not clear how to adapt 
Newton method to the stochastic case. The crux of the problem is that while it is easy 
to construct an unbiased, low variance, estimate of the gradient, based on a single ex¬ 
ample, it is not clear how to construct a good estimate of the Newton’s direction based 
on a small mini-batch of examples. 

Many approaches have been proposed for speeding up Newton’s method. For ex¬ 
ample, the i?{-} operator technique llT4ll22l[T0l l^. However, these methods are not ap¬ 
plicable for the stochastic setting. An obvious way to decrease the storage and compu¬ 
tational cost is to only consider the diagonal elements of the Hessian (see El). Schrau- 
dolph ifTSl proposed an adaptation of the L-BFGS approach to the online setting, in 
which at each iteration, the estimation of the inverse of the Hessian is computed based 
on only the last few noisy gradients. Naturally, this yields a low rank approximation. In 
El , the two aforementioned approaches are combined to yield the SGD-QN algorithm. 
In the same paper, an analysis of second order SGD is described, but with A being 
always the Hessian matrix at the optimum (which is of course not known). There are 
various other approximations, see for example msiiiEiiiia. 

To tackle the second problem, several methods lIT^ I^ I^fTSI rely on different vari¬ 
ants of the Gauss-Newton approximation of the Hessian. A somewhat related approach 
is Amari’s natural gradient descent nil El. See the discussion in lfT3ll . To the best of our 
knowledge, these methods come with no theoretical guarantees. 

The aforementioned approaches change the conditioner at each iteration of the al¬ 
gorithm. A general treatment of this approach is described in ||TT| [Section 1.3.1] under 
the name “Variable Metric”. Maybe the most relevant approach is the Adagrad al¬ 
gorithm ||6|, which was originally proposed for the online learning setting but can be 
easily adapted to the stochastic optimization setting. In our notation, the AdaGrad al- 
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gorithm uses a (pn) x (pn) conditioning matrix that changes along time and has the 
form, At = 57+i VtV7, where Vt = vec{yiy.{WtXi))xJ). There are several 
advantages of our method relatively to AdaGrad. First, the convergence bound we ob¬ 
tain is better than the convergence bound of AdaGrad. Specifically, while both bounds 
have the sane dependence on 11(7^/^114;., our bound depends on ||fF*||sp while AdaGrad 
depends on 11 W* 1. As we discussed before, there may be a gap of p between 11 W* \\gp 
and ||fF*||p. More critically, when using a full conditioner, the storage complexity of 
our conditioner is n^, while the storage complexity of AdaGrad is (np)^. In addition, 
the time complexity of applying the update rule is [p + n)n for our conditioner versus 
{npY' for AdaGrad. For this reason, most practical application of AdaGrad relies on 
a diagonal approximation of At- In contrast, we can use a full conditioner in many 
practical cases, and even when n is large our sketched conditioner can be applied with¬ 
out a significant increase in the complexity relatively to vanilla SGD. Finally, because 
we derive our algorithm for the stochastic case (as opposed to the adversarial online 
optimization setting), and because we bound the component V^y{W^x) using the Lip- 
schitzness of ly, the conditioner we use is the constant (7“^/^ along the entire run of 
the optimization process, and should only be calculated once. In contrast, AdaGrad 
replaces the conditioner in every iteration. 

2 Conditioning and Sketched Conditioning 

As mentioned previously, the algorithms we consider start with an initial matrix Wi 
and at each iteration update the matrix according to Equation Q. The following lemma 
provides an upper bound on the expected sub-optimality of any algorithm of this form. 

Lemma 1. Fix a positive definite matrix A G R"^". Let W* be the minimizer of 
Equation 0, let a G M. be such that a > ||kF*||sp and denote C = -^ 'YllLi ■ 
Assume that for every i, iy. is convex and p-Lipschitz- Then, if we apply the update 
rule given in Equation 0 using the conditioner A and denote W = ^ 

E[L(1E) - L{W*)] < ^tr(AlE*^iy*) -f [tr(A-i(7)] 

In particular, for p = a/ {p\/T), we obtain 

nL{W) - L{W*)] < ^(tr(A) + tr(A-iC)) . 

The proof of the above lemma can be obtained by replacing the standard inner 
product with the inner product induced by A. For completeness, we provide a proof in 
AppendixlAl 

In AppendixlAl we show that the conditioner which minimizes the bound given in 
the above Lemma is A = (7^/^. This yields: 
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Theorem 1. Following the notation ofLemma\J] assume that we run the meta-algorithm 
with A = then 


E[L{W)-L{W*)]<^-triC^r2y 


2.1 Sketched Conditioning 

Let k < n and assume that rank(C') > k. Consider the following family of condition¬ 
ers; 

A = {A = QBQ^ + a{I - QQ^) : Q G Q = J, B ^ 0 G a > 0} 

(4) 

Before proceeding, we show that the conditioners in A are indeed positive definite, and 
give a formula for their inverse. 

Lemma 2. Let A = QBQ^ -\- a{I — QQ^) G A. Then, Ay 0 and its inverse is given 
by 

= QB-^Q^ + a-\l - QQ^) . 

Informally, every conditioner A G Ais a combination of a low rank conditioner 
and the identity conditioner. The most appealing property of these conditioners is that 
we can compute A~^x in time 0{nk) and therefore the time complexity of calculating 
the update given in Equation (O is 0{n{p -f k)). 

In the next subsections we focus on instances of A which are induced by an ap¬ 
proximate best rank-fc approximation of C. However, for now, we give an analysis for 
any choice of A G A. 

Theorem 2. Following the notation ofLemmaU] let A G A and denote C = CQ. 
Then, if a = we have 

E\L{W) - L{W*)] < ^ . (^tr(i?) + tr(i?-iC) + 2^(n - fc)(tr(C) - tr(C)) 

2.2 Low-rank conditioning via exact low-rank approximation 

Maybe the most straightforward approach of defining Q and B is by taking the leading 
eigenvectors of C. Concretely, let C = UDU^ be the eigenvalue decomposition of 
C and denote the diagonal elements of ZI by Ai > ... > A„ > 0. Recall that for 
any k < n, the best rank-fc approximation of C is given by Ck = UkDkUj, where 
Uk G consists of the first fc columns of U and Dk is the first fc x fc sub-matrix 

of D. Denote C = Q^CQ and consider the conditioner A which is determined from 
Equation (|4|l by setting Q = Uk, B = and a as in Theorem|2] 

Theorem 3. Let Q = Uk, B = C^!'^ and a as in Theorem\^ and consider the condi¬ 
tioner given in Equation 0- Then, 

E[L{W) - L{W*)] < ^ • (tr(C,'/") + v'(n - fc)(tr(C) - tr(Cfc))) . 
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Figure 1 


In particular, if \/{n — k){tr{C) — tr(C'fc)) = 0(tr(C'^/^)), then the obtained bound 
is of the same order as the bound in Theorem\J] 

We refer to the condition \/{n — fc)(tr(C') — tr(C'fe)) = 0(tr(C'^/^)) as a fast 
spectral decay property of the matrix C. 

2.3 Low-rank conditioning via sketching 

The conditioner dehned in the previous subsection requires the exact calculation of the 
matrix C and its eigenvalue decomposition. In this section we describe a faster tech¬ 
nique for calculating a sketched conditioner. Before formally describing the sketching 
technique, let us try to explain the intuition behind it. Figure [T] depicts a set of 1000 
(blue) random points in the plane. Suppose that we represent this sequence by a matrix 
X G ^ 2 x 1000 draw a vector w G whose coordinates are A/’(0,1) 

i.i.d. random variables and consider the vector z = Xu). The vector z is simply a 
random combination of these points. As we can see, z coincides with the strongest 
direction of the data. More generally, the idea of sketching is that if we take a ma¬ 
trix X G and multiply it from the right by random matrix fl G then 

with high probability, we preserve the strongest directions of the column space of X. 
The above intuition is formalized by the following result, which follows from iniby 
setting e = 10. 

Lemma 3. Let X G Let r = 0(/c) and let Lt G be a random matrix 

whose elements are i.i.d. 1/r) random variables. Let P G R"^’’ be a matrix 

whose columns form an orthonormal basis of the column space ofXLt, let U G R'’^* 
be a matrix whose columns are the top k eigenvectors of the matrix {P^ X){P^ X)^, 
and let Q — PU G R"^^. Then, 

E\\QQ^X-X\\F<2\\X-Xk\\F- (5) 

Let X G R"^™ be a matrix whose columns are the vectors xi,..., Xm- Based on 
Lemma[3l we produce a matrix Q G R"^^ which satisfies the inequality E[\\QQ^X — 

^See also [23i rLemmas 4.1,4.2]. In particular, the elements of Q can be drawn either according to be 
i.i.d. A/*(0,1/r) or zero-mean ±1 random variables. Also, the bounds on the lower dimension in are 
better in (additive) factor k log k. 
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< 2||X — Xfclli;’. Let C = Q^CQ. Our sketched conditioner is determined by 
the matrix Q and the matrix B = (7^/^. As we show in Algorithm[Tl we can compute 
a factored form of the inverse of the conditioner, A~^, in time 0{mnk). We turn to 


Algorithm 1 Sketched Conditioning: Preprocessing 


Input: 7f £ K"’™ , Parameters: fc<n,r£0(/c) 

Output: Q, B~^, a~^ that determines a conditioner according to Equation (HI 

Sample each element of i.i.d. from A/'(0, r 

Compute Z = Xn 

# in time 0{mnr) 

[p,^] = qR{z) 

# in time 0{r‘^n) 

Compute Y = X 

# in time 0{mnr) 

Compute the SVD: Y = U'Y'V'^ 

# in time 0{mr’^) 

Compute Q = PU'j, 

# in time 0{nrk) 

Compute (7 = Q^CQ 

# in time 0{mkn) 

Compute B~^ = 

# in time 0{k^) 

Compute a ^ ^ 

# in time 0{mn + k) 


discuss the performance of this conditioner. Relying on Lemma[2 we start by relating 
the trace of (7 = Q^CQ to the trace of C. 

Lemma 4. We have tr((7) — tr((7) < 4(tr(C') — tr((7fc)). 

The next lemma holds for any choice of Q £ with orthonormal columns. 

Lemma 5. Assume that C is of rank at least k. Let Q £ with orthonor¬ 

mal columns and define C = CQ^, B = (7^/^. Then, tr(i?) = tr(i3“^(7) = 
0(tr(C^/")). 

Combining the last two lemmas with Theorem|2l we conclude: 

Theorem 4. Consider running SCSGD with the conditioner given in Algorithm [7] 
Then, 

E[L(VL) - L{W*)] < O • (tr(Cy") + sj[n - k)(ir{C) - ir{Ck)))^ . 

In particular, if the fast spectral decay property holds, i.e., \J (n — fc)(tr(C') — tr(C'/c)) = 
0(tr((7^/^)), then the obtained bound is of the same order as the bound in Theorem\J\ 


3 Experiments with Deep Learning 

While our theoretical guarantees were derived for convex problems, the conditioning 
technique can be adapted for deep learning problems, as we outline below. 

A feedforward deep neural network is a function / that can be written as a com¬ 
position / = /i o /2 o ... o /q, where each is called a layer function. Some of 
the layer functions are predefined, while other layer functions are parameterized by 
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weights matrices. Training of a network amounts to optimizing w.r.t. the weights ma¬ 
trices. The most popular layer function with weights is the affine layer (a.k.a. “fully 
connected” layer). This layer performs the transformation y = Wx + b, where cc G R”, 
W G RP’", and b G R^. The network is usually trained based on variants of stochastic 
gradient descent, where the gradient of the objective w.r.t. W is calculated based on 
the backpropagation algorithm, and has the form where i5 G R^. 

To apply conditioning to an affine layer, instead of the vanilla SGD update W = 
W — rjSx^, we can apply a conditioned update of the form W = W — rjS{A~^x)^. To 
calculate A we could go over the entire training data and calculate C = A XixJ . 
However, unlike the convex case, now the vectors Xi are not constant but depends on 
weights of previous layers. Therefore, we initialize C = I and update it according to 
the update rule C = {1 — 1^)0 + vxixj. for some v G (0,1). From time to time, we 
replace the conditioner to be for the curi'ent value of A. In our experiments, 

we updated the conditioning matrix after each 50s iterations. Note that the process of 
calculating A = C^/'^ can be performed in a different thread, in parallel to the main 
stochastic gradient descent process, and therefore it causes no slowdown to the main 
stochastic gradient descent process. 

The same technique can be applied to convolutional layers (that also have weights), 
because it is possible to write a convolutional layer as a composition of a transforma¬ 
tion called “Im2Col” and a vanilla affine layer. Besides these changes, the rest of the 
algorithm is the same as in the convex case. 

Below we describe two experiments in which we have applied conditioning tech¬ 
nique to a popular variant of stochastic gradient descent. In particular, we used stochas¬ 
tic gradient descent with a mini-batch of size 64, a learning rate of rjt = 0.01(1 + 
O.OOOlf)”^/'*, and with Nesterov momentum with parameter 0.9, as described in ll20l . 
To initialize the weights we used the so-called Xavier method, namely, chose each ele¬ 
ment of W at random according to a uniform distribution over [—a, a], with a = \fzjn. 
We chose these parameters because they are the default in the popular Caffe library 
(http://caffe. berkeleyvision . org), without attempting to tune them. We 
conducted experiments with the MNIST dataset fS) and with the Street View House 
Numbers (SVHN) dataset lfT2l . 

MNIST: We used a variant of the LeNet architecture Q. The input is images of 

28 X 28 pixels. We apply the following layer functions: Convolution with kernel size 
of 5 X 5, without padding, and with 20 output channels. Max-pooling with kernel 
size of 2 X 2. Again, a convolutional and pooling layers with the same kernel sizes, 
this time with 50 output channels. Finally, an affine layer with 500 output channels, 
followed by a ReLU layer and another affine layer with 10 output channels that forms 
the prediction. In short, the architecture is: conv 5x5x20, maxpool 2x2, conv 5x5x50, 
maxpool 2x2, affine 500, relu, affine 10. 

For training, we used the multiclass log loss function. Figure |2] and Figure |3 show 
the training and the test errors both w.r.t. the multiclass log loss function and the zero- 
one loss (where the a:-axis corresponds to the number of iterations). In both graphs, we 
can see that SCSGD enjoys a much faster convergence rate. 
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Figure 2: MNIST data. Train (left) and test (right) errors w.r.t. the multiclass log loss 
of SGD and SCSGD 



Figure 3; MNIST data. Train (left) and test (right) errors w.r.t. the zero-one loss of 
SGD and SCSGD 
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Figure 4: SVHN data. Train (left) and test (right) errors w.r.t. the multiclass log loss of 
SGD and SCSGD 



Figure 5; SVHN data. Train (left) and test (right) errors w.r.t. the zero-one loss of SGD 
and SCSGD 


SVHN: In this experiment we used a much smaller network. The input is images of 
size 32 X 32 pixels. Using the same terminology as above, the architecture is now: conv 
5x5x8, relu, conv 5x5x16, maxpool 2x2, conv 5x5x16, maxpool 2x2, affine 32, relu, 
affine 32, relu, affine 10, avgpool 4x4. The results are summarized in the graphs of 
Figure |4] and Figure |3 We again see a superior convergence rate of SCSGD relatively 
to SGD. 
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A Proofs Omitted from The Text 

Proof, (of Lemmallll Using the notation from Section[T] denote 


At = \DA{W\Wt) - ^DA{W\Wt+i) . 

As in the standard proof of SGD (for the Lipschitz case), we consider the progress 
of Wt towards W*. Recall that Wt+i = Wt — rjV£y.^{WtXi^)xJ^A~^, where it G 
[m] is the random index that is drawn at time t. For simplicity, denote the p x n 
matrix by Gt- Thus, Wt+i — Wt = GtA~^. By standard algebraic 

manipulations we have 

At = \tY{A{W^ - Wt+i)^{W* - Wt+i)) - ^tiiAiW* - WtViW* - Wt)) 

= tr((lU* - W't+i)A(W't+i - Wt)^) + ^tT{{Wt+i - Wt)^AiWt+i - Wt)) 

= ti{{W* - Wt+i)A{-pGtA-Y) + ^ti{{r,GtA-^)A{pGtA-Y) 

= ■ triiWt+i - W^)Gj) + !^tT{GtA-^Gj) 

= 7J ■ tiiGj(Wt+i - lU*)) + !^tr(GtA-iG7) 

= r; ■ tr(G7 (Wt - W*)) + p • tr(G7(lUt+i - Wt)) + ^tr(GtA-iG7) 

= ry • tr(G7 (Wt - W*)) - if • tr(GtA-iG7) + ^tr(Gt A’^G^) 

= t; • tr(G7 (Wt - W*)) - "^triGtA-^Gj) 

2 2 

> ry(Gt, W* - VFt+i) - ^tr(xfA-^xJ . 

where in the last inequality we used the fact that the loss function is p-Lipschitz. Sum¬ 
ming over t and dividing by rj we obtain 

T 2 

^(Gt,TUt-lU*) < tr(A(lU*-lUi)^(VF*-fUi)) + ^tr(A-i^x.,Xt^). 

t=i ^ t 

Recall that Wi = 0. Note that the expected value of Gt is the gradient of L at Wt and 
the expected value of Xi^xf is G. Taking expectation over the choice of it for all t, 
dividing by T and relying on the fact that L{Wt) — L[W*) < (VL(Wt), lUt — W*), 
we obtain 


L{W)-L{W*) < ^tr(ATU*^lU*) + ^tr(A-iG) 
2ryT 2 


□ 
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Proof, (of Theorem [T]l For simplicity, we assume that C has full rank. If this is not 
the case, one can add a tiny amount of noise to the instances to make sure that C is of 
full rank. 

We would like to optimize tr(A) + ir{A~^C) over all positive dehnite matrics. 
Since every matrix A>~ 0 can be written as A = tM, where M 0, tr(M) = 1 and 
T = tr(A), an equivalent objective is given by 

2 2 

min min ——r + ■^^tr(M“^C') . (6) 

r>0 M^O: 2r?T 2 t ^ ' 

tr(M) = l 

The following lemma characterizes the optimizer. 

Lemma 6. Let C >- 0. Then, 

min tr(M“^C) = (tr(C^^^))^ , 

Mt-O: 

tr(M)<l 

and the minimum is attained by M* = (tr(C^/^))“^ • (7^/^. 

Straightforward optimization over r yields the value r = tr(C'^/^). Subtituitng r 
and M in Equation (|6]l and applying Lemma [1] we conclude the proof of Theorem [1] 

□ 

Proof, (of Lemma |6ll First, it can be seen that M* is feasible and attains the claimed 
minimal value. We complete the proof by showing the following inequality for any 
feasible A: 

tr(M-iC) > (tr(Ci/2))2 , 

We claim the following analogue of Fan’s inequality; For any symmetric matrix M G 

j^nxn 

trjM-^C) > (A^(M-i),A-^(C')) = ((A^(M))-\A^(C)) , 

where j, and f are used to represent decreasing and increasing orders, respectively, and 
for a vector x = {xi,..., x„) with positive components, x~^ = (l/xi,..., l/xn). 
The equality is clear so we proceed by proving the inequality. Let M he a n x n 
symmetric matrix. Assume that C = KuiuJ and M = ^17=1 are the 

spectral decompositions of C and M, respectively. Letting aij = {ui, vf), we have 

tr(M"^C') . 

ij 


Note that since both vi,... ,Vn and ui,... ,Un form orthonormal bases, the matrix 
Z G whose {i, j)-th element is a?^ is doubly stochastic. So, we have 

tr{M-^C) = . 

Viewing the right side as a function of Z, we can apply Birkhoff’s theorem and con¬ 
clude that the minimum is obtained by a permutation matrix. The claimed inequality 
follows. 
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Thus, we next consider the objective 


miny^ \i/Hi , 

1—1 

where E = {^ G K" : ^ !}■ The corresponding Lagrangiar@ is 

n n n 

L{fi- a) = Ai/pi - y] aiHi + a„+i Hi - 1) ■ 

i—1 i—1 i—1 

Next, we compare the differential to zero and rearrange, to obtain 

(Az//iDr=i = (a»+i - «*)”=! • 


By complementary slackness, ai = ... = = 0. Thus, 


p, = cA, 


1/2 


1/2 

for some c > 0. The constraint Hi ^ ^ implies that c = X]i=i ^i ■ Substituting 
the minimizer in the objective, we conclude the proof. □ 


Proof, (of Lemma |2) Since B = EE^ for some matrix E, it follows that QBQ^ = 
QE{QE)^, thus it is positive semidefinite. The matrix a{I — QQ^) is clearly positive 
semidefinite. It remains to show that A is invertible and thus it is positive definite. We 
have 

AA-^ = (QBQ^ + a(I - QQ^))(QB-^Q^ + 

= QBQ^QB-^Q^ + (/ - QQ^)(I - QQ^) + 0 + 0 
= QQ^ + / — QQ^ 

= I. 


□ 


Proof, (of Theorem 13 Recall that 

A = QBQ^ +a{I -QQ^) . 

According to Lemma [T] we need to show that 

tr(A) + tr(A“^C) < tr(i?) + tr(i?“^C') + 2\J{n — /c)(tr(C') — tr(C')) . 

Since the trace is invariant to cyclic permutations, we have 

tr(A) = ii{QBQ^) + a ■ tr(J — QQ^) 

= tT{Q^ QB) + a{n — k) 

= tr(B) + a(n — k) . 

^The strict inequalities > 0 are not allowed, but we can replace them with weak inequalities and let 
/(/i) = oo for any /i whose one of its components is not greater than zero 
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Using Lemma|2l we obtain 

ti{A-^C) = tiiQB-^Q^C) + a-^ ■ tr((/ - QQ^)C) 

= tr{B-^Q^CQ) + a-\tr{C) - tr(QQ^C')) 

= tiiB-^C) + a-i(tr(C) - tr((7)) . 

Subtituting a = we complete the proof. □ 

Proof, (of Theorem|3 Note that 

C = QACQ = UjUDU^Uk = UkDkUk) = Ck ■ 

B = . 

B-^C = C-^I'^C = = Cy^ . 

Invoking Theorem|2] we obtain the desired bound. □ 

Proof, (of Lemma IHl With a alight abuse of notation, we consider the decomposition 
C = Ck + Cn-k (here, Cn-k corresponds to the last n — k eigenvalues rather than to 
the first n — k eigenvalues). We need to show that 

tr(C') > tr(C'fe) - 3tr(C'd_fe) . 

Let X = —^X, where X G ]^rixm matrix whose columns are Xi,... ,Xm- 

Note that C = XX^. Also, since Q satisfies Equation (|5]l w.r.t. X, it satisfies the 
same inequality w.r.t. X. Let X = UT,V^ be the SVD of X. Note that the same 
matrix U participates in the SVD (EVD) of the matrix C — XX^, i.e., we have 
C = UDU^, where D = Y?. Recall that the best rank-fc approximation of X is 
UkUjX = UkYkV^. By assumption, 

\\QQ^X-X\\%<A\\UkUjX-X\\l (7) 


Note that 

||X - QQ^XW^p = tr{X^X) + tr{X^QQ^QQ^X) - 2tT{X^QQ^X) 
= tr(C') - tr(QQ^C') = tr(C') - tr(C) . 


Similarly, 

WUkUjX- X\\j, = tr(C) - triUkUjC) = tr(C) - tr(Cfe) . 


Thus, Equation (|7]i implies that 

tr(C') — tr(C') < 4(tr(C') — tr(C'fe)) . 


Hence, 

tr(C') > 4tr(C'fc) - 3tr(C') = ti{Ck) - 3tr(C'd-fc) . 


□ 
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Proof, (of Lemma H) First, we note that B = B-^C = Thus, we need to show 

that tr((7^/^) = 0(tr(C'y^)). Second, we observe that for every positive scalar b, we 
have 

tr(Ci/2) = 0(tr(C'y")) tT{bC^/^) = 0(tr(6Cy")) . 

Denote the k top eigenvalues of C and C by Ai,..., Afc and Ai,..., Afc, respectively. 
According to the above observation, we may assume w.l.o.g. that A^ > 1 for all i G [fc] 
(simply consider b = 

Let X = UT,V^ be the SVD of X, where X = {l/y/m)X. Since UkUjX is the 
best rank-fc approximation of X, we have 

\\UtUjX-X\\%<\\QQ^X-X\\% 

for all Q G with orthonormal columns. As in the proof of Lemma|4] this implies 
that 

tr(C'fe) > tr(C') . 

Therefore, 


.r(C‘/>) - .r(Cy^) = yX) = ^ 

i=l i=l V Xi + 'fXi 

k 

< Xi — Xi 

i=l 

= tr(C) - tr(Cfc)) < 0 


where the first inequality follows from the assumption that Ai > 1 for all i G [k], □ 
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